Data Visualisation Project¶

By Alexandre COGORDAN & Victor BOCQUIN¶

Our motivation¶

Working on the dataset of car accidents in India provides an opportunity to explore complex dynamics related to road safety in a specific context.

The diversity of variables, such as road conditions, driver characteristics, vehicle details, and accident causes, allows for a deeper understanding of contributing factors to accidents.

Analyzing these data can not only highlight key challenges in road safety but also provide crucial insights to guide targeted preventive initiatives. By understanding collision patterns, profiles of at-risk drivers, and predominant environmental conditions, we could, if it were a real project, contribute to improving road safety policies and maybe reducing accidents, safer road environment in India.

Introduction - Understanding our dataset¶

In [ ]:
import pandas as pd
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly
import plotly.express as px
import pandas as pd

pd.set_option('display.max_columns', None)
plotly.offline.init_notebook_mode()
In [ ]:
df = pd.read_csv('road.csv')
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Time                         12316 non-null  object
 1   Day_of_week                  12316 non-null  object
 2   Age_band_of_driver           12316 non-null  object
 3   Sex_of_driver                12316 non-null  object
 4   Educational_level            11575 non-null  object
 5   Vehicle_driver_relation      11737 non-null  object
 6   Driving_experience           11487 non-null  object
 7   Type_of_vehicle              11366 non-null  object
 8   Owner_of_vehicle             11834 non-null  object
 9   Service_year_of_vehicle      8388 non-null   object
 10  Defect_of_vehicle            7889 non-null   object
 11  Area_accident_occured        12077 non-null  object
 12  Lanes_or_Medians             11931 non-null  object
 13  Road_allignment              12174 non-null  object
 14  Types_of_Junction            11429 non-null  object
 15  Road_surface_type            12144 non-null  object
 16  Road_surface_conditions      12316 non-null  object
 17  Light_conditions             12316 non-null  object
 18  Weather_conditions           12316 non-null  object
 19  Type_of_collision            12161 non-null  object
 20  Number_of_vehicles_involved  12316 non-null  int64 
 21  Number_of_casualties         12316 non-null  int64 
 22  Vehicle_movement             12008 non-null  object
 23  Casualty_class               12316 non-null  object
 24  Sex_of_casualty              12316 non-null  object
 25  Age_band_of_casualty         12316 non-null  object
 26  Casualty_severity            12316 non-null  object
 27  Work_of_casuality            9118 non-null   object
 28  Fitness_of_casuality         9681 non-null   object
 29  Pedestrian_movement          12316 non-null  object
 30  Cause_of_accident            12316 non-null  object
 31  Accident_severity            12316 non-null  object
dtypes: int64(2), object(30)
memory usage: 3.0+ MB
In [ ]:
df.head()
Out[ ]:
Time Day_of_week Age_band_of_driver Sex_of_driver Educational_level Vehicle_driver_relation Driving_experience Type_of_vehicle Owner_of_vehicle Service_year_of_vehicle Defect_of_vehicle Area_accident_occured Lanes_or_Medians Road_allignment Types_of_Junction Road_surface_type Road_surface_conditions Light_conditions Weather_conditions Type_of_collision Number_of_vehicles_involved Number_of_casualties Vehicle_movement Casualty_class Sex_of_casualty Age_band_of_casualty Casualty_severity Work_of_casuality Fitness_of_casuality Pedestrian_movement Cause_of_accident Accident_severity
0 17:02:00 Monday 18-30 Male Above high school Employee 1-2yr Automobile Owner Above 10yr No defect Residential areas NaN Tangent road with flat terrain No junction Asphalt roads Dry Daylight Normal Collision with roadside-parked vehicles 2 2 Going straight na na na na NaN NaN Not a Pedestrian Moving Backward Slight Injury
1 17:02:00 Monday 31-50 Male Junior high school Employee Above 10yr Public (> 45 seats) Owner 5-10yrs No defect Office areas Undivided Two way Tangent road with flat terrain No junction Asphalt roads Dry Daylight Normal Vehicle with vehicle collision 2 2 Going straight na na na na NaN NaN Not a Pedestrian Overtaking Slight Injury
2 17:02:00 Monday 18-30 Male Junior high school Employee 1-2yr Lorry (41?100Q) Owner NaN No defect Recreational areas other NaN No junction Asphalt roads Dry Daylight Normal Collision with roadside objects 2 2 Going straight Driver or rider Male 31-50 3 Driver NaN Not a Pedestrian Changing lane to the left Serious Injury
3 1:06:00 Sunday 18-30 Male Junior high school Employee 5-10yr Public (> 45 seats) Governmental NaN No defect Office areas other Tangent road with mild grade and flat terrain Y Shape Earth roads Dry Darkness - lights lit Normal Vehicle with vehicle collision 2 2 Going straight Pedestrian Female 18-30 3 Driver Normal Not a Pedestrian Changing lane to the right Slight Injury
4 1:06:00 Sunday 18-30 Male Junior high school Employee 2-5yr NaN Owner 5-10yrs No defect Industrial areas other Tangent road with flat terrain Y Shape Asphalt roads Dry Darkness - lights lit Normal Vehicle with vehicle collision 2 2 Going straight na na na na NaN NaN Not a Pedestrian Overtaking Slight Injury

We display the percentage of null values per columns and proceed to drop them

In [ ]:
df.isna().sum() / len(df) * 100
df = df.dropna()

Graph 1 - Distribution of accidents by day of the week.¶

Each bar represents the count of accidents on a specific day, with colors distinguishing days. The checkbox filter allows users to explore this distribution based on the gender of drivers, offering insights into how accident patterns vary across different days for selected genders.We can observe that overall the number of accidents is higher on Fridays. When filtering for men, it is also on Fridays. However, when filtering for women, there were more accidents on Mondays.¶
In [ ]:
app = dash.Dash(__name__)


app.layout = html.Div([
    dcc.Graph(id='accidents-by-day'),
    dcc.Checklist(
        id='gender-filter',
        options=[
            {'label': 'Male', 'value': 'Male'},
            {'label': 'Female', 'value': 'Female'},

        ],
        value=['Male', 'Female'],
        labelStyle={'display': 'block'}
    )
])

@app.callback(
    Output('accidents-by-day', 'figure'),
    [Input('gender-filter', 'value')]
)
def update_graph(selected_genders):
    filtered_df = df[df['Sex_of_driver'].isin(selected_genders)]
    fig = px.histogram(filtered_df, x='Day_of_week', color='Day_of_week',
                       category_orders={"Day_of_week": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]},
                       title='Accidents by Day of the Week',
                       labels={'Day_of_week': 'Day of the Week'})
    return fig


if __name__ == '__main__':
    app.run_server(debug=True)

Graph 2 - Distribution of accidents by educational level¶

We thought that this would be a good idea to know if a higher educational level meant a lower accident rate. Seeing the results, it seems true, but this could also be explained by the age and therefore the driving experience of the drivers - a factor we'll later in the other graphs.¶
In [ ]:
fig2 = px.histogram(df, x='Educational_level', title='Distribution of accidents by educational level')
fig2.show()

Graph 3 - Violin plot of Age Distribution by Day of the Week¶

A violin plot illustrating the distribution of driver age bands across different days of the week, providing insights into the age demographics associated with accidents on each day. Each day seems very similar but we can notice that for exemple there are more accident with the 18 - 50 year olds on friday than on monday.¶
In [ ]:
fig3 = px.violin(df, x='Day_of_week', y='Age_band_of_driver', title='Age Distribution by Day of the Week')
fig3.show()

Graph 4 - 3D scatter plot of Age, Service year of vehicle, and Casualty severity¶

This 3D Scatter Plot depicts the relationship between driver age bands, gender, and the number of accidents (nb_Accident). The color of the points represents the intensity of accident occurrences. We can see that Male driver with an age 18 - 50 have many accident in this dataset.¶
In [ ]:
df2 = df.copy()
df2['nb_Accident'] = 1
df2 = df2.groupby(['Sex_of_driver', 'Age_band_of_driver']).count().reset_index()
df2
Out[ ]:
Sex_of_driver Age_band_of_driver Time Day_of_week Educational_level Vehicle_driver_relation Driving_experience Type_of_vehicle Owner_of_vehicle Service_year_of_vehicle Defect_of_vehicle Area_accident_occured Lanes_or_Medians Road_allignment Types_of_Junction Road_surface_type Road_surface_conditions Light_conditions Weather_conditions Type_of_collision Number_of_vehicles_involved Number_of_casualties Vehicle_movement Casualty_class Sex_of_casualty Age_band_of_casualty Casualty_severity Work_of_casuality Fitness_of_casuality Pedestrian_movement Cause_of_accident Accident_severity nb_Accident
0 Female 18-30 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
1 Female 31-50 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
2 Female Over 51 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
3 Female Under 18 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
4 Female Unknown 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140
5 Male 18-30 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987 987
6 Male 31-50 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906
7 Male Over 51 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352 352
8 Male Under 18 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194
9 Male Unknown 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249 249
10 Unknown 18-30 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
11 Unknown 31-50 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
12 Unknown Over 51 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
13 Unknown Under 18 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
14 Unknown Unknown 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
In [ ]:
fig4 = px.scatter_3d(df2, x='Age_band_of_driver', y='Sex_of_driver', z='nb_Accident',
                    title='Age, Service year of vehicle, and Casualty severity',color='nb_Accident')

fig4.show()

Graph 5: Line chart of the number of accidents over time¶

The following line chart illustrates the trend of accident severity over time. This can be useful to analyse possible time based accidents and when most accidents happen during the day.¶
In [ ]:
df3 = df.copy()
df3['nb_Accident'] = 1


df3 = df3.groupby(['Time']).count().reset_index()

# We've smoothened the data in order to have a visualisation more readable
df3['nb_Accident'] = df3['nb_Accident'].rolling(10).mean()

fig5 = px.line(df3, x='Time', y='nb_Accident', title='Number of Accidents Over Time')

fig5.show()

Graph 6: Repartition of accidents by light conditions¶

What we wished to analyse with this graph was the possible effect of light on accidents. We expected that a lot of accidents happened in the darkness because of the reduced vision conditions. However, most accidents happened during the day, which still makes sense because there is heavier traffic during that period. However, we were still very suprised that the accidents that happened in darker conditions were mostly caused in roads where the lights were lit. We thinks that it's because many roads have light units¶
In [ ]:
import plotly.express as px
fig6 = px.pie(df, names='Light_conditions', color_discrete_sequence=px.colors.sequential.RdBu)
fig6.show()

Graph 7: Vehicle Type and Driving Experience¶

We analysed the types of vehicules that were the most involved in an accident then proceeded to check the age of the drivers in these different types of vehicules. This allows us to find the most 'dangerous' mean of transport and their most 'dangerous' types of drivers (in terms of driving experience which is based on the data they've obtained their driving licence if they even have any). We can observe that most accidents are caused by automobile drivers with a driving experience between 5 to 10 years.¶
In [ ]:
fig7 = px.treemap(df, path=['Type_of_vehicle', 'Driving_experience'], title='Vehicle Type and Driving Experience')
fig7.show()

Graph 8: Weather and Road Conditions¶

We wanted to check the number of accidents based on the weather, the type of road and its condition. From our result, we've come to understand that the majority of accidents happen on dry, asphalt roads with normal conditions. This is also explained because most roads are made of asphalt, and conditions in india are mostly dry. The results might have been a lot more different for a northen country like Sweden for example.¶
In [ ]:
fig8 = px.sunburst(df, path=['Weather_conditions', 'Road_surface_type', 'Road_surface_conditions'], title='Weather and Road Conditions')
fig8.show()

Graph 9: Junction and Collision Types¶

We've tried to find a possible link between the junction type and the collision type. This would make sense as more accidents are likely to happen the same way in the same settings. We also added buttons to sort the number of accidents by collision type on the left so that we can see which junction type is most associated with it. From what we've observed, the Y-shape junctions seem to be the most dangerous (which is probably explained by a priority issue with drivers) and unsurpringly, most accidents are vehicle to vehicle.¶
In [ ]:
fig9 = px.bar(df, x='Types_of_Junction', color='Type_of_collision', title='Junction and Collision Types')
fig9.update_layout(updatemenus=[dict(type='buttons', 
                                    showactive=True, 
                                    buttons=[dict(label=alignment, method='relayout', args=['xaxis.categoryorder', 'total descending']) for alignment in df['Type_of_collision'].unique()])])
fig9.show()

Graph 10: Accidents by areas¶

For our last analysis, we decided to look at the number of accidents by areas. We thought this would let us get hypotheses of the possible causes of those accidents. For example, we expect more accidents being caused in area close to school or workplaces during the days of the week. However, from our results, we found out that although some part of our hypotheses run true, they are hardly reliable as some results like the number of accidents in school happen at midnight (this isn't because of school traffic but inly because it somehow happened near the school). We concluded that the location of accidents, apart if it occurs often next to some area (because of the type of junctions expected in these areas) is hardly reliable.¶
In [ ]:
df['Time'] = pd.to_datetime(df['Time']).dt.hour

day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df['Day_of_week'] = pd.Categorical(df['Day_of_week'], categories=day_order, ordered=True)

df = df.sort_values('Day_of_week')
df.dropna(inplace=True)
/var/folders/p9/flhwy6kx3s75yh91rftjm9dr0000gn/T/ipykernel_26970/4054642662.py:1: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

In [ ]:
df10 = df[['Area_accident_occured', 'Time']].value_counts().reset_index()
df10.columns = ['Area_accident_occured', 'Time', 'Count']
df10 = pd.merge(df10, df, on=['Area_accident_occured', 'Time'], how='left')
df10.sort_values('Area_accident_occured', inplace=True)

fig = px.scatter(df10, x="Time", y='Count', animation_frame="Day_of_week", animation_group="Area_accident_occured",
                 size="Count", color="Area_accident_occured", hover_name="Area_accident_occured",
                 labels={"Time": "Time of Day", "Count": "Accident Count", "Area_accident_occured": "Area of Accident"})

fig.show()
/Users/alexandrecogordan/miniconda3/lib/python3.11/site-packages/plotly/express/_core.py:2044: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.